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Abstract 

We present scalable parallel algorithms with sublinear per-processor communication volume 
and low latency for several fundamental problems related to finding the most relevant elements in 
a set, for various notions of relevance: We begin with the classical selection problem with unsorted 
input. We present generalizations with locally sorted inputs, dynamic content (bulk-parallel 
priority queues), and multiple criteria. Then we move on to finding frequent objects and top-k 
sum aggregation. Since it is unavoidable that the output of these algorithms might be unevenly 
distributed over the processors, we also explain how to redistribute this data with minimal 
communication. 

Keywords: selection, frequent elements, sum aggregation, priority queue, sampling, branch- 
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1 Introduction 

Overheads due to communication latency and bandwidth limitations of the communication networks 
are one of the main limiting factors for distributed computing. Parallel algorithm theory has considered 
the latency part of this issue since its beginning. In particular, execution times poly logarithmic in 
the number p of processing elements (PEs) were a main focus. Borkar [9] argues that an exascale 
computer could only be cost-effective if the communication capabilities (bisection width) scale highly 
sublinearly with the amount of computation done in the connected subsystems. Google confirms 
that at data center scale, the network is the most scarce resource and state that they “don’t know 
how to build big networks that deliver lots of bandwidth” [36]. In a previous paper [34] we therefore 
proposed to look more intensively for algorithms that require bottleneck communication volume 
sublinear in the local input size. More precisely, consider an input consisting of n machine words 
distributed over the PEs such that each PE holds 0{n/p) words. Then sublinear communication 
volume means that no PE sends or receives more than o(n/p ) machine words of data. 

Here, we combine the latency and data volume aspects. We consider some fundamental algorithmic 
problems that have a large input (size n ) and a relatively small output (size k). Since potentially 
many similar problems have to be solved, both latency and communication volume have to be very 
small—ideally polylogarithmic in p and the input parameters. More precisely, we consider problems 
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that ask for the k most “relevant” results from a large set of possibilities, and aim to obtain low 
bottleneck communication overhead. 

In the simplest case, these are totally ordered elements and we ask for the k smallest of them—the 
classical selection problem. Several variants of this problem are studied in Section 4. For the 
classical variant with unsorted inputs, a careful analysis of a known algorithm [31] shows that the 
previously assumed random allocation of the inputs is actually not necessary and we get running 
time + logp). For locally sorted input we get latency 0(log 2 kp). Interestingly, we can return to 
logarithmic latency if we are willing to relax the output size k by a constant factor. This uses a new 
technique for obtaining a pivot element with a given rank that is much simpler than the previously 
proposed techniques based on sorting. 

A data structure generalization of the selection problem are bulk-parallel priority queues. Previous 
parallel priority queues are not communication efficient in the sense that they move the elements 
around, either by sorting [13] or by random allocation [31]. Section 5 generalizes the results on 
selection from Section 4. The key to making this work is to use an appropriately augmented search 
tree data structure to efficiently support insertion, deletion, and all the operations needed by the 
parallel selection algorithm. 

A prominent problem in information retrieval is to extract the k most relevant objects (e.g. 
documents), where relevance is determined by a monotonous function that maps several individual 
scoring functions to the overall relevance. For each individual score, a list of objects is precomputed 
that stores the objects in order of decreasing score. This is a multicriteria generalization of the sorted 
top-A; selection problem discussed in Section 4. In Section 6 we parallelize established sequential 
algorithms for this problem. The single-criterion selection algorithms are used as subroutines—the 
sorted one for approximating the list elements scanned by the sequential algorithm and the unsorted 
one to actually identify the output. The algorithm has poly logarithmic overhead for coordinating 
the PEs and manages to contain unavoidable load imbalance in a single phase of local computation. 
This is achieved with a fast estimator of the output size generated with a given number of scanned 
objects. 

A fundamental problem in data mining is finding the most frequently occurring objects. This is 
challenging in a distributed setting since the globally most frequent elements do not have to be locally 
frequent on any particular PE. In Section 7 we develop very fast sampling-based algorithms that find 
a (e, (f)-approximation or probably approximately correct answer, i.e., with probability at least 1 — 5 
the output is correct within en. The algorithms run in time logarithmic in n, p, and 1/5. From a 
simple algorithm with running time factor 1/e 2 we go to more sophisticated ones with factor 1/e. 
We also show how to compute the exact result with probability at least 1 — 5 if the elements are 
non-uniformly distributed. 

Subsequently, we generalize these results to sum aggregation, where object occurrences are 
associated with a value. In Section 8, we are thus looking for the objects whose values add up to the 
highest sums. 

All of the above algorithms have the unavoidable limitation that the output may be unevenly 
distributed over the PEs for general distributions of the input objects. This can lead to load imbalance 
affecting the efficiency of the overall application. Offsetting this imbalance will require communication 
so that one might argue that striving for communication efficiency in the selection process is in 
vain. However, our methods have several advantages over non-communication efficient selection 
algorithms. First of all, priorities can be ignored during redistribution since all selected elements are 
relevant. Hence, we can employ any data redistribution algorithm we want. In particular, we can 
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Table 1: Our main results. Parameters: input size n; output size k: number of PEs p\ startup 
overhead a; communication cost per word /3; relative error e; failure probability 5. Monitoring 
queries are marked with 


Problem 

Asymptotic running time 

old 

in our model O(-) 
new 

Unsorted Selection 

ft (/3p) +«log p [311 

p +f jmi n(Mogp’p) +«iogp 

Sorted Selection 

H(fclogn) (seq.) [38[ 

a log 2 kp a log kp (flexible k) 

Bulk Priority Queue 
insert* + deleteMin* 

log fe + «(p + lo gp) [31] 

a log 2 kp cr log kp (flexible k) 

Heavy Hitters 

0 + a ^ logn [19] t for 5 = const. 

cf. top -k most frequent objects 

Top-/c Most 

Frequent Objects 

Top-/c Sum 
Aggregation 

n (?+0‘+“l) pp 

P+ZSi^logJ+alogn 

^ + /?! yjp log j + ap (centralized) 

? + /3 i? f\/P»e! + “ lo S’* 

Multicriteria Top-A: 

not comparable 

log K ( m 2 log K + f3m + a log p) 


use an adaptive algorithm that only moves data if that is really necessary. In Section 9, we give one 
such algorithm that combines prefix sums and merging to minimize data movement and incurs only 
logarithmic additional delay. Delegating data movement to the responsibility of the application also 
has the advantage that we can exploit properties of the application that a general top -k selection 
algorithm cannot. For example, multicriteria selection algorithms as discussed in Section 6 are used 
in search engines processing a stream of many queries. Therefore, it is enough to do load balancing 
over a large number of queries and we can resolve persistent hot spots by adaptive data migration or 
replication. Another example are branch-and-bound applications [20,31]. By randomizing necessary 
data migrations, we can try to steer the system away from situations with bad data distribution (a 
proper analysis of such an algorithm is an open problem though). 

Some of our main results are listed in Table 1. Refer to Appendix B for proofs and further 
discussion. 

2 Preliminaries 

Our input usually consists of a multiset M of \M\ = n objects, each represented by a single machine 
word. If these objects are ordered, we assume that their total ordering is unique. This is without 
loss of generality since we can make the value v of object x unique by replacing it by the pair ( v , x) 
for tie breaking. 

Consider p processing elements (PEs) connected by a network. PEs are numbered 1 ..p. where a..b 
is a shorthand for {a,..., b} throughout this paper. Assuming that a PE can send and receive at 
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most one message at a time (full-duplex, single-ported communication), sending a message of size m 
machine words takes time a + m/3. We often treat the time to initiate a connection a and to send a 
single machine word (3 as variables in asymptotic analysis. A running time of 0(x + f3y + az) then 
allows us to discuss internal work x, communication volume y, and latency z separately. Depending 
on the situation, all three aspects may be important, and combining them into a single expression 
for running time allows us to specify them concisely. 

We present some of the algorithms using high level pseudocode in a single program multiple data 
(SPMD) style—the same algorithm runs on each PE, which perform work on their local data and 
communicate predominantly through collective operations. Variables are local by default. We can 
denote data on remote PEs using the notation x@i, which refers to the value of variable x on PE i. 
For example, x@i denotes the global sum of x over all PEs, which can be computed using a sum 
(all-)reduction. 

Collective communication. Broadcasting sends a message to all PEs. Reduction applies an 
associative operation (e.g., sum, maximum, or minimum) to a vector of length m. An all-reduction 
also broadcasts this result to all PEs. A prefix-sum (scan) computes x ®'i' on PE 3 wh ere ^ is a 
vector of length m. The scatter operation distributes a message of size m to a set of y PEs such 
that each of them gets a piece of size m/y. Symmetrically, a gather operations collects y pieces of 
size m/y on a single PE. All of these operations can be performed in time 0(/3m + alogp) [5,33]. 
In an all-to-all personalized communication (all-to-all for short) each PE sends one message of size m 
to every other PE. This can be done in time 0{ fir rip + ap) using direct point-to-point delivery or in 
time 0(/3mplogp + alogp) using indirect delivery [21, Theorem 3.24]. The all-to-all broadcast (aka 
gossiping) starts with a single message of size m on each PE and ends with all PEs having all these 
messages. This operation works in time 0{f3mp + alogp). 

Fast inefficient Sorting. Sorting O^^/p) objects can be done in time 0 (alogp) using a brute 
force algorithm performing all pairwise object comparisons in parallel (e.g., [2]). 

Search trees. Search trees can represent a sorted sequence of objects such that a variety of 
operations can be supported in logarithmic time. In particular, one can insert and remove objects 
and search for the next largest object given a key. The operation T.split(a:) splits T into two search 
trees Ti, T 2 such that T\ contains the objects of T with key < x and T 2 contains the objects of T 
with key > x. Similarly, if the keys in a tree Ti are smaller than the keys in T 2 then concat(Ti, T 2 ) 
returns a search tree representing the concatenation of the two sequences. If one additionally stores 
the size of subtrees, one can also support the select operation T[i], which returns the i-th largest 
object in T, and operation T. rank(x), which returns the number of objects in T of value at most x. 
For an example of such a data structure see [23, Chapter 7]. 

Chernoff bounds. We use the following Chernoff bounds [23,26] to bound the probability that a 
sum A = X\ + ... + X n of n independent indicator random variables deviates substantially from its 
expected value E [A]. For 0 < <p < 1, we have 


P [A < (1 - <p)E [A]] < e ~^ E[x] 

(1) 

P [A > (1 + <p)E [A]] < e -^ E[x] . 

(2) 
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— base case 


Function select(s : Sequence of Object; k : N) : Object 
if k = 1 then return min,- s[l]@z 
(£,r):— pickPivots(s) 

a\— (e € s : e < £) 
b:= (e&s:£<e<r) 

c:— (e E s : e > r) — partition 

n a -— Yli l a l@*; fib'-— —reduction 

if n a > k then return select (a, k) 

if n a + rib < k then return select(c, k — n a — rib) 

return select(fo, k — n a ) 


Algorithm 1: Communication efficient selection. 


Bernoulli sampling. A simple way to obtain a sample of a set M of objects is to select each 
object with probability p independent of the other objects. For small sampling probability p, the 
naive implementation can be significantly accelerated from time 0(\M\) to expected time 0(p\M\) 
by using skip values—a value of x indicates that the following x—1 elements are skipped and the x-th 
element is sampled. These skip values follow a geometric distribution with parameter p and can be 
generated in constant time. 

Distributed FR-Selection [31] Floyd and Rivest [16] developed a modification of quickselect 
using two pivots in order to achieve a number of comparisons that is close to the lower bound. 
This algorithm, FR-select, can be adapted to a distributed memory parallel setting [31]. FR-select 
picks the pivots based on sorting a random sample S of 0(y/p) objects, which can easily be done 
in logarithmic time (e.g., the simple algorithm described in [2] performs brute-force comparisons 
of all possible pairs of sample elements). Pivots £ and r are chosen as the sample objects with 
ranks k\S\ /n ± A where A = p 1/,4+<5 for some small constant <5 > 0. For the analysis, one can 
choose 5 = 1/6, which ensures that with high probability the range of possible values shrinks by a 
factor ©(p 1 / 3 ) in every level of recursion. 

It is shown that a constant number of recursion levels suffice if n = 0(plogp) and if the objects 
are distributed randomly. Note that the latter assumption limits communication efficiency of the 
algorithm since it requires moving all objects to random PEs for general inputs. 

Bulk Parallel Priority Queues. A natural way to parallelize priority queues is to use bulk 
operations. In particular, operation deleteMin* supports deletion of the k smallest objects of the 
queue. Such a data structure can be based on a heap where nodes store sorted sequences rather 
than objects [13]. However, an even faster and simpler randomized way is to use multiple sequential 
priority queues—one on each PE [31]. This data structure adopts the idea of Karp and Zhang [20] 
to give every PE a representative approximate view of the global situation by sending inserted 
objects to random queues. However, in contrast to Karp and Zhang [20], [31] implements an exact 
deleteMin* using parallel selection and a few further tricks. Note that the random insertions limit 
communication efficiency in this case. 
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3 More Related Work 


The fact that the amount of communication is important for parallel computing is a widely studied 
issue. However, there are very few results on achieving sublinear communication volume per processor. 
In part, this is because models like parallel external memory or resource-oblivious models assume 
that the input is located in a global memory so that processing it requires loading all of it into the 
local caches at least once. Refer to [34] for a more detailed discussion. 

A good framework for studying bottleneck communication volume is the BSP model [37] where a 
communication round with maximum amount of data per PE h (maximized over sent and received 
data) takes time l + hg for some machine parameters l (latency) and g (gap). Summing the values 
of h over all rounds yields the bottleneck communication volume we consider. We do not use the 
BSP model directly because we make heavy use of collective communication routines which are 
not directly represented in the BSP model. Further, the latency parameter t is typically equal 
to ap in real-world implementations of the BSP model (see also [2]). Also, we are not aware of 
any concrete results in the BSP model on sublinear communication volume in general or for top-A 
problems in particular. A related recent line of research considers algorithms based on MapReduce 
(e.g., [17]). Communication volume is an important issue there but it seems difficult to achieve 
sublinear communication volume since the MapReduce primitive does not model locality well. 

Recently, communication avoiding algorithms have become an important research direction for 
computations in high performance linear algebra and related problems, e.g. [11]. However, these 
results only apply to nested loops with array accesses following a data independent affine pattern 
and the proven bounds are most relevant for computations with work superlinear in the input size. 
We go into the opposite direction looking at problems with linear or even sublinear work. 

Selection. Plaxton [28] shows a superlogarithmic lower bound for selection on a large class of 
interconnection networks. This bound does not apply to our model, since we assume a more powerful 
network. 

Parallel Priority Queues. There has been intensive work on parallel priority queues. The most 
scalable solutions are our randomized priority queue [31] and Deo and Prasad’s parallel heap [13]. 
Refer to [39] for a recent overview of further approaches, most of which are for shared memory 
architectures and mostly operate on centralized data structures with limited scalability. 

Multicriteria, Most Frequent Objects. Considerable work has been done on distributed top-A: 
computations in wide area networks, sensor networks and for distributed streaming models [4,10,19, 
25,40]. However, these papers use a master-worker approach where all communication has to go over 
a master node. This implies up to p times higher communication volume compared to our results. 
Only rarely do randomized versions with better running times exist, but nevertheless, communication 
volume at the master node still increases with ^Jp [19]. Moreover, the scalability of TPUT [10] 
and KLEE [25] is severely limited by the requirement that the number of workers cannot exceed 
the number of criteria. Furthermore, the top-A’ most frequent objects problem has received little 
attention in distributed streaming algorithms. The only work we could find requires 0{Np ) working 
memory at the master node, where N = 0(n) is the number of distinct objects in the union of the 
streams [4]. The majority of papers instead considers the significantly easier problem of identifying 
the heavy hitters, i.e. those objects whose occurrences account for more than a fixed proportion of 
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the input, or frequency tracking, which tracks the approximate frequencies of all items, but requires 
an additional selection step to obtain the most frequent ones [19,40]. 

Sum Aggregation Much work has been done on aggregation in parallel and distributed settings, 
e.g. in the database community [12,22,27]. However, these papers all ask for exact results for all 
objects, not approximations for the k most important ones. We do not believe that exact queries are 
can be answered in a communication-efficient manner, nor could we find any works that consider the 
same problem, regardless of communication efficiency. 

Data redistribution using prefix sums is standard [8]. But combining it with merging allows 
adaptive communication volume as described in Section 9, which seems to be a new result. There are 
algorithms for minimizing communication cost of data distribution in arbitrary networks. However, 
these are much more expensive. Meyer auf der Heide et al. [24] use maximum flow computations. 
They also give an algorithm for meshes, which needs more steps than the lower bound though. 
Solving the diffusion equations minimizes the sum of the squares of the data to be moved in order 
to achieve perfect load balance [14]. However, solving this equation is expensive and for parallel 
computing we are more interested in the bottleneck communication volume. 

4 Selection 

We consider the problem of identifying the elements of rank < k from a set of n objects, distributed 
over p processors. First, we analyze the classical problem with unsorted input, before turning to 
locally sorted input in Sections 4.2 and 4.3. 

4.1 Unsorted Input 

Our first result is that using some minor adaptations and a more careful analysis, the parallel 
FR-algorithm from Section 2 does not actually need randomly distributed data. 

Theorem 1 Consider n elements distribided over p PE such that each PE holds 0{n/p) elements. 
The k globally smallest of these elements can be identified in expected time 

( Tl f Tl\ 

—b (3 min yfp log„ n, — + a log n 

p V p) 

Proof. (Outline) The algorithm from [31] computes the pivots based on a sample S of size 0(yfpY 
There, this is easy since the objects are randomly distributed in all levels of recursion. Here we 
can make no such assumption and rather assume Bernoulli sampling with probability y/p/^2i 
Although this can be done in time proportional to the local sample size, we have to be careful since 
(in some level of recursion) the input might be so skewed that most samples come from the same 
processor. Since the result on sorting only works when the input data is uniformly distributed over 
the PEs, we have to account additional time for spreading the samples over the PEs. 

Let x denote the PE which maximizes |s@x| in the current level of recursion in the algorithm 
from Figure 1. The total problem size shrinks by a factor of H(p 1//3 ) in each level of recursion with 
high probability. Hence, after a constant number of recursion levels, x is 0(n/p) as well. Moreover, 
after log^^i/ 3 ) ^ = 0( log p ^) further levels of recursion, the problem is solved completely with high 
probability. Overall, the total number of recursion levels is 0(1 + log p ^) = C4(log p n). 
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This already implies the claimed 0(n/p ) bound on internal computation - 0(1 ) times 0(n/p) 
until x has size 0(n/p) and a geometrically shrinking amount of work after that. 

Of course, the same bound 0((5 also applies to the communication volume. However, we also 
know that even if all samples come from a single PE, the communication volume in a single level of 
recursion is o(Vp)- 

Finally, the number of startups can be limited to 0(logp) per level of recursion by using a 
collective scatter operation on those PEs that have more than one sample object. This yields 
0(logplog p n) = O(logn). □ 


Corollary 2 If a and /? are viewed as constants, the bound from Theorem 1 reduces to O^ + logpj. 


The bound from Theorem 1 immediately reduces to Oy^ + log nj for constant a and j3. We can 
further simplify this by observing that when the logn term dominates n/p, then logn = 0{\ogp). 


4.2 Locally Sorted Input 

Selection on locally sorted input is easier than the unsorted problem from Section 4.1 since we only 
have to consider the locally smallest objects and are able to locate keys in logarithmic time. Indeed, 
this problem has been studied as the multisequence selection problem [35,38]. 

In [2], we propose a particularly simple and intuitive method based on an adaptation of the 
well-known quickselect algorithm [18, 23]. For self-containedness, we also give that algorithm 
in Appendix A, adapting it to the need in the present paper. This algorithm needs running 
time O^a log 2 kp ) and has been on the slides of Sanders’ lecture on parallel algorithms since 2008 [32], 
Algorithm 9 can be viewed as a step backwards compared to our algorithm from Section 4.1 as using 
a single random pivot forces us to do a deeper recursion. We do that because it makes it easy to 
tolerate unbalanced input sizes in the deeper recursion levels. However, it is possible to reduce the 
recursion depth to some extent by choosing pivots more carefully and by using multiple pivots at 
once. We study this below for a variant of the selection problem where we are flexible about the 
number k of objects to be selected. 


4.3 Flexible k , Locally Sorted Input 

The O (log 2 pk) startups incurred in Section 4.2 can be reduced to Oflogpk) if we are willing to 
give up some control over the number of objects that are actually returned. We now give two input 
parameters k and k and the algorithm returns the k smallest objects so that k < k < k. We begin 
with a simple algorithm that runs in logarithmic time if k — k = Tl f k ) and then explain how to 
refine that for the case k — k = o(k). 

The basic idea for the simple algorithm is to take a Bernoulli sample of the input using a success 
probability of 1/x, for x G k..k. Then, the expected rank of the smallest sample object is x , i.e., 
we have a truthful estimator for an object with the desired rank. Moreover, this object can be 
computed efficiently if working with locally sorted data: the local rank of the smallest local sample 
is geometrically distributed with parameter 1/x. Such a number can be generated in constant time. 
By computing the global minimum of these locally smallest samples, we can get the globally smallest 
sample v in time 0(a\ogp). We can also count the exact number k of input objects bounded by this 
estimate in time 0(\ogk + a log p) —we locate v in each local data set in time 0(logk) and then 


sum the found positions. If k E k..k, we are done. Otherwise, we can use the acquired information 
as in any variant of quickselect. 

At least in the recursive calls, it can happen that k is close to the total input size n. Then it 
is a better strategy to use a dual algorithm based on computing a global maximum of a Bernoulli 
sample. In Algorithm 2 we give pseudocode for a combined algorithm dynamically choosing between 
these two cases. It is an interesting problem which value should be chosen for x. The formula used 
in Algorithm 2 maximizes the probability that k E k..k. This value is close to the arithmetic mean 
when k/k~l but it is significantly smaller otherwise. The reason for this asymmetry is that larger 
sampling rates decrease the variance of the geometric distribution. 

Theorem 3 If k — k = 0, (Jfj, then Algorithm amsSelect from Figure 2 finds the k smallest elements 
with k E k..k in expected time Offogk + alogp). 

Proof. (Outline) One level of recursion takes time 0{a\ogp) for collective communication opera¬ 
tions (min, max, or sum reduction) and time 0(\ogk ) for locating the pivot v. It remains to show 
that the expected recursion depth is constant. 

We actually analyze a weaker algorithm that keeps retrying with the same parameters rather than 
using recursion and that uses probe x = k. We show that, nevertheless, there is a constant success 
probability (i.e., k<k<k with constant probability). The rank of v is geometrically distributed 
with parameter \/x. The success probability becomes 


E 


l\ fc_1 1 

k) k 


k 

k- 1 




which is a positive constant if k — k = (k ). 


□ 


Multiple Concurrent Trials. The running time of Algorithm 2 is dominated by the logarithmic 
number of startup overheads for the two reduction operations it uses. We can exploit that reductions 
can process long vectors using little additional time. The idea is to take d Bernoulli samples of the 
input and to compute d estimates for an object of rank x. If any of these estimates turns out to have 
exact rank between k and k, the recursion can be stopped. Otherwise, we solve a recursive instance 
consisting of those objects enclosed by the largest underestimate and the smallest overestimate found. 

Theorem 4 If k-k = n(k/d), an algorithm, processing batches of d Bernoulli samples can be 
implemented to run in expected time 0(d log k + (3d + a logp). 

Proof. (Outline) A single level of recursion runs in time O^dlogk + (3d + a log p). Analogous to 
the proof of Theorem 3, it can be shown that the success probability is (1 /d) for a single sample. 
This implies that the probability that any of the d independent samples is successful is constant. □ 

For example, setting d = O(logp), we obtain communication time O(alogp) and 0{\ogk\ogp) 
time for internal work. 
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— min-based estimator 


Function amsSelect(s : Object []; k : N; k : N) : Object [] 
if k < n — k then 

x geometricRandomDeviate ^1 — k-k+i^ 

if x > |s| then v:— oo else v:= s[x] 
v:= mini<j< p v@i 

else 

x\— geometricRandomDeviate ^1 — *~- +1 ^ 

if x > |s| then v:= — oo else v:— s[|s| — x + 1] 
v := maxi<j< p i;@* 

find j such that s[l..j] < v and s[j + 1..] > v 

k 'Ei<i< P \j@i\ 

if k < k then 

return s[l..j] U amsSelect(s[j + l..|s|], k — k , k — k,n — k ) 

if k > k then 

return amsSelect(s[l..j], k, k, k) 
return s[l..j] 


— minimum reduction 
— max-based estimator 


— maximum reduction 
— sum all-reduction 


Algorithm 2: Approximate multisequence selection. Select the k globally smallest out of n objects 
with k < k < k. Function geometricRandomDeviate is a standard library function for generating 
geometrically distributed random variables in constant time [29]. 

5 Bulk Parallel Priority Queues 

We build a global bulk-parallel priority queue from local sequential priority queues as in [20,31], 
but never actually move elements. This immediately implies that insertions simply go to the local 
queue and thus require only O(logn) time without any communication. Of course, this complicates 
operation deleteMin*. The number of elements to be retrieved from the individual local queues can 
vary arbitrarily and the set of elements stored locally is not at all representative for the global content 
of the queue. We can not even afford to individually remove the objects output by deleteMin* from 
their local queues. 

We therefore replace the ordinary priority queues used in [31] by search tree data structures that 
support insertion, deletion, selection, ranking, splitting and concatenation of objects in logarithmic 
time (see also Section 2). To become independent of the actual tree size of up to n, we furthermore 
augment the trees with two arrays storing the path to the smallest and largest object respectively. 
This way, all required operations can be implemented to run in time O(logfc) rather than O(logn). 

Operation deleteMin* now becomes very similar to the multi-sequence selection algorithms from 
Section 4.3 and Appendix A. The only difference is that instead of sorted arrays, we are now working 
on search trees. Tliis implies that selecting a local object with specified local rank (during sampling) 
now takes time 0(log k) rather than constant time. However, asymptotically, this makes no difference 
since for any such selection step, we also perform a ranking step, which takes time Oilogk) anyway 
in both representations. 
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One way to implement the recursion in the selection algorithms is via splitting. Since the split 
operation is destructive, after returning from the recursion, we have to reassemble the previous state 
using concatenation. Another way that might be faster and simpler in practice is to represent a 
subsequence of s by s itself plus cursor information specifying rank and key of the first and last 
object of the subsequence. 

Now, we obtain the following by applying our results on selection from Section 4. 

Theorem 5 Operation deleteMin* can be implemented to run in the following expected times. With 
fixed batch size k, expected time 0(a log 2 kp ) suffices. For flexible batch size in k..k, where k — k = 
17 (/c), we need expected time 0(alogkp). Ifk — k = 12 ( k/d ), expected time 0(d\ogk + fid + alogp) 
is sufficient. 

Note that flexible batch sizes might be adequate for many applications. For example, the parallel 
branch-and-bound algorithm from [31] can easily be adapted: In iteration i of its main loop, it 
deletes the smallest ki = 0(p) elements (tree nodes) from the queue, expands these nodes in parallel, 
and inserts newly generated elements (child nodes of the processed nodes). Let K = ki denote 
the total number of nodes expanded by the parallel algorithm. One can easily generalize the proof 
from [31] to show that K = m. + 0(hp) where m is the number of nodes expanded by a sequential best 
first algorithm and h is the length of the path from the root to the optimal solution. Also note that 
a typical branch-and-bound computation will insert significantly more nodes than it removes—the 
remaining queue is discarded after the optimal solutions are found. Hence, the local insertions of 
our communication efficient queue are a big advantage over previous algorithms, which move all 
nodes [20,31]. 

6 Multicriteria Top -k 

In the sequential setting, we consider the following problem: Consider m lists Li of scores that 
are sorted in decreasing order. Overall relevance of an object is determined by a scoring func¬ 
tion t(x i,..., x m ) that is monotonous in all its parameters. For example, there could be m keywords 
for a disjunctive query to a fulltext search engine, and for each pair of a keyword and an object, it is 
known how relevant this keyword is for this object. Many algorithms used for this setting are based 
on Fagin’s threshold algorithm [15]. The lists are partially scanned and the algorithm maintains 
lower bounds for the relevance of the scanned objects as well as upper bounds for the unscanned 
objects. Bounds for a scanned object x can be tightened by retrieving a score value for a dimension 
for which x has not been scanned yet. These random accesses are more expensive than scanning, in 
particular if the lists are stored on disk. Once the top-A scanned objects have better lower bounds 
than the best upper bound of an unscanned object, no more scanning is necessary. For determining 
the exact relevance of the top-A scanned objects, further random accesses may be required. Various 
strategies for scanning and random access yield a variety of variants of the threshold algorithm [6]. 

The original threshold algorithm works as follows [15]: In each of K iterations of the main 
loop, scan one object from each list and determine its exact score using random accesses. Let Xi 
denote the smallest score of a scanned object in L,. Once at least k scanned objects have score at 
least t{x i,..., x m ), stop, and output them. 

We consider a distributed setting where each PE has a subset of the objects and m sorted lists 
ranking its locally present objects. We describe communication efficient distributed algorithms that 
approximate the original threshold algorithm (TA) [15]. First, we give a simple algorithm assuming 
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Procedure DTA((Li,..., L m ), t, k ) 


K:= 

do 


k 

mp 


for i:= 1 to m do 

L\ amsSelect(Lj, K, 2K, \Li\), (-,Xi):= L'.last 

Amin*— ■ j %rn) 

for i:= 1 tom 
Ri, Hi\= 0 

for j:= 1 to y = (9(log K) 

sample an element x = ( o , s) from L[ 

if ■ (o, s') € Lj then R. 

elsif t(o) > t m in then H t 

£i-.= IL'I (l- R) ■ R 
1 1 ll V y J y 

H:= £j=i (E*=i *i) @3 

K:= 2 K 

until H > 2k 

return (t min , (L \,.. .,L^)) 


— lower bound for A' 


— Xi min selected score 
— threshold 
— each PE locally 


— ignore samples contained in other L'- 
— t(o) does lookups in all Lj to compute score 
— truthful estimator of =Lhits for L' on this PE 

— exponential search 
— =^> k hits found whp 


Algorithm 3: Multicriteria Top-fc for arbitrary data distribution (DTA) on Lists Li,..., L m with 
scoring function t 


random data distribution of the input objects (RDTA) and then describe a more complex algorithm 
for arbitrary data distribution (DTA). 

Random Data Distribution. Since the data placement is independent of the relevance of the 
objects, the top-A; objects are randomly distributed over the PEs. Well known balls-into-bins bounds 
give us tight high probability bounds on the maximal number k of top-A: objects on each PE [30]. 
Here, we work with the simple bound k = (9(| +logp). The RDTA algorithm simply runs TA locally 
to retrieve the k locally most relevant objects on each PE as result candidates. It then computes a 
global threshold as the maximum of the local thresholds and verifies whether at least k candidate 
objects are above this global threshold. In the positive case, the k most relevant candidates are found 
using the selection algorithm from [31]. Otherwise, k is increased and the algorithm is restarted. In 
these subsequent iterations, PEs whose local threshold is worse than the relevance of the A:-th best 
element seen so far do not need to continue scanning. Hence, we can also trade less local work for 
more rounds of communication by deliberately working with an insufficiently small value of k. 

Arbitrary Data Distribution. We give pseudocode for Algorithm DTA in Figure 3. 

Theorem 6 Algorithm DTA requires expected time 0(m 2 log 2 K + [3m log K + a log p log A') to 
identify a set of at most O(K) objects that contains the set of K objects scanned by TA. 

Proof. (Outline) Algorithm DTA “guesses” the number K of list rows scanned by TA using expo¬ 
nential search. This yields 0(logK) rounds of DTA. In each round, the approximate multisequence 
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selection algorithm from Section 4.3 is used to approximate the globally K -th largest score Xi 
in each list. Define L[ as the prefix of A* containing all objects of score at least Xi. This takes 
time 0(log K + alogp) in expectation by Theorem 3. Accumulating the searches for all m lists 
yields 0(mlogK + (3m + alogp) time. Call an object selected by the selection algorithm a hit if 
its relevance is above the threshold t(x i,..., x m ). DTA estimates the number of hits using sampling. 
For each PE and list i separately, y = (D(log A) objects are sampled from A'. Each sample takes 
time 0(m ) for evaluating t(-). Multiplying this with rn lists and log A' iterations gives a work bound 
of m 2 log K. Note that this is sublinear in the work done by the sequential algorithm which takes 
time m?K to scan the A most important objects in every list. To eliminate bias for objects selected 
in multiple lists L DTA only counts an object x if it is sampled in the first list that contains it. 
Otherwise, x is rejected. DTA also counts the number of rejected samples R. Let H denote the 
number of nonrejected hits in the sample. Then, lk= |A'| (1 — R/y ) is a truthful estimate for the 
length of the list with eliminated duplicates and is a truthful estimate for the number of hits for 
the considered list and PE. Summing these rnp values using a reduction operation yields a truthful 
estimate for the overall number of hits. DTA stops once this estimate is large enough such that 
with high probability the actual number of hits is at least k. The output of DTA are the prefixes 

..., L' m of the lists on each PE, the union of which contains the k most relevant objects with 
high probability. It also outputs the threshold t(x i,... ,x m ). In total, all of this combined takes 
expected time 0(m 2 log 2 K + /An log A' + alogplog A"). □ 

Actually computing the k most frequent objects amounts to scanning the result lists to find all 
hits and, if desired, running a selection algorithm to identify the k most relevant among them. The 
scanning step may involve some load imbalance, as in the worst case, all hits are concentrated on a 
single PE. This seems unavoidable unless one requires random distribution of the objects. However, 
in practice it may be possible to equalize the imbalance over a large number of concurrently executed 
queries. 

Refinements. We can further reduce the latency of DTA by trying several values of K in each 
iteration of algorithm DTA. Since this involves access to only few objects, the overhead in internal 
work will be limited. In practice, it may also be possible to make good guesses on AT based on 
previous executions of the algorithm. 

7 Top -k Most Frequent Objects 

We describe two probably approximately correct (PAC) algorithms to compute the top-fc most frequent 
objects of a multiset M with \M\ = n, followed by a probably exactly correct (PEC) algorithm for 
suitable inputs. Sublinear communication is achieved by transmitting only a small random sample of 
the input. For bound readability, we assume that M is distributed over the p PEs so that none has 
more than 0(n/p ) objects. This is not restricting, as the algorithms’ running times scale linearly 
with the maximum fraction of the input concentrated at one PE. 

We express the algorithms’ error relative to the total input size. This is reasonable—consider a 
large input where k objects occur twice, and all others once. If we expressed the error relative to the 
objects’ frequencies, this input would be infeasible without communicating all elements. Thus, we 
refer to the algorithms’ relative error as i. defined so that the absolute error in is the count of the 
most frequent object that was not output minus that of the least frequent object that was output, 
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Figure 4: Example for the PAC top-k most frequent objects algorithm from Section 7.1. Top: 
input distributed over 4 PEs, sampled elements (p = 0.3) highlighted in red. Bottom: distributed 
hash table counting the samples. The k = 5 most frequent objects, highlighted in red, are to be 
returned after scaling with 1/p. Estimated result: ( E , 17), (A, 13), (T, 10), (/, 10), (0,10). Exact 
result: (E, 16), (A, 10), (T, 10), (I, 9), ( D , 8). Object D was missed, instead O (count 7) was returned. 
Thus, the algorithm’s error is 8 — 7 = 1 here. 


or 0 if the result was exact. Let 5 limit the probability that the algorithms exceeds bound e, i.e. 
P [e > e] <5. We refer to the result as an (e, ^-approximation. 

7.1 Basic Approximation Algorithm 

First, we take a Bernoulli sample of the input. Sampling is done locally. The frequencies of the 
sampled objects are counted using distributed hashing—a local object count with key k is sent to 
PE h(k ) for a hash function h that we here expect to behave like a random function. We then select 
the k most frequently sampled objects using the unsorted selection algorithm from Section 4.1. An 
example is illustrated in Figure 4. 

Theorem 7 Algorithm PAC can be implemented to compute an (e, 5)-approximation of the top-k 
most frequent objects in expected time 0(/3^r log § + alogn). 

Lemma 8 For sampling probability p, Algorithm PAC runs in O + j3y logp + alogn^ time in 
expectation. 

Proof. (Outline) Bernoulli sampling is done in expected time 0(np/p) by generating skip values 
with a geometric distribution using success probability p. Since the number of sample elements in a 
Bernoulli sample is a random variable, so is the running time. To count the sampled objects, each 
PE aggregates its local samples in a hash table, counting the occurrence of each sample object during 
the sampling process. It inserts its share of the sample into a distributed hash table [34] whose hash 
function we assume to behave like a random function, thus distributing the objects randomly among 
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the PEs. The elements are communicated using indirect delivery to maintain logarithmic latency. 
This requires logp + a log p) time in expectation. To minimize worst-case communication 

volume, the incoming sample counts are merged with a hash table in each step of the reduction. 
Thus, each PE receives at most one message per object assigned to it by the hash function. 

From this hash table, we select the object with rank k using Algorithm 1 in expected time 
0( r ^ + (3^ + alognp). This pivot is broadcast to all PEs, which then determine their subset of 
at least as frequent sample objects in expected time 0(J^ + a log p). These elements are returned. 
Overall, the claimed time complexity follows using the estimate p < n and np < n in order to 
simplify the a-term. □ 

Lemma 9 Let p e (0,1) be the sampling probability. Then, Algorithm PAC ’s error e exceeds a value 
of A/n with probability 

_ A ^ pk _ A A 

P [in > A] < 2 ne ia« + ke s™ . 

Proof. We bound the error probability as follows: the probability that the error i exceeds some 
value A/n is at most n times the probability that a single object’s value estimate deviates from its 
true value by more than A/2 in either direction. This probability can be bounded using Chernoff 
bounds. We denote the count of element j in the input by Xj and in the sample by Sj. Further, 
let Fj be the probability that the error for element j exceeds A/2 in either direction. Let j > k, 
and observe that Xj < n/k. Using Equations 1 and 2 with X = Sj, E [[] X] = pxj, and <p = we 
obtain: 


r ( a\i 


r ( a\ i 

Sj <p\ Xj --\ 

+p 

Sj >p[Xj + -J 


A ^ pk A^ pk A ^ pk 

< e + e < 2e 


This leaves us with the most frequent k — 1 elements, whose counts can be bounded as Xj < 

n. As overestimating them is not a concern, we apply the Chernoff bound in Equation 1 and 

r / am a^ p _ 

obtain P I Sj < p (xj — I < . In sum, all error probabilities add up to the claimed value. □ 


We bound this result by an error probability <5. This allows us to calculate the minimum required 
sample size given 6 and A = en. Solving the above equation for p yields 


4 /3 4n 2 k\ 

max \k 11 T’ 2 n T/ ’ 


(3) 


which is dominated by the latter term in most cases and yields pn > In W- for the expected sample 
size. 

Proof. (Theorem 7/ Equation 3 yields pn = log |). The claimed running time bound then 
follows from Lemma 8. □ 


Note that if we can afford to aggregate the local input, we can also use the Sum Aggregation 
algorithm from Section 8 and associate a value of 1 with each object. 
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7.2 Increasing Communication Efficiency 

Sample sizes proportional to 1/e 2 quickly become unacceptably large as e decreases. To remedy 
this, we iterate over the local input a second time and count the most frequently sampled objects’ 
occurrences exactly. This allows us to reduce the sample size and improve communication efficiency 
at the cost of increased local computation. We call this Algorithm EC for exact counting. Again, 
we begin by taking a Bernoulli sample. Then we find the k* > k globally most frequent objects in 
the sample using the unsorted selection algorithm from Section 4.1, and count their frequency in 
the overall input exactly. The identity of these objects is broadcast to all PEs using an all-gather 
(gossiping, all-to-all broadcast) collective communication operation. After local counting, a global 
reduction sums up the local counts to exact global values. The k most frequent of these are then 
returned. 

Lemma 10 When counting the k' most frequently sampled objects’ occurrences exactly, a sample 
size of np = In j suffices to ensure that the result of Algorithm EC is an (e, S)-approximation of 
the top-k most freqiLent objects. 

Proof. With the given error bounds, we can derive the required sampling probability p similar to 
Lemma 9. However, we need not consider overestimation of the k* most frequent objects, as their 
counts are known exactly. We can also allow the full margin of error towards underestimating their 
frequency (A instead of A/2) and can ignore overestimation. This way, we obtain a total expected 
sample size of pn > In ^. □ 

We can now c alculate th e value of k! that minimizes the total communication volume and obtain 
k* = max(&, e \J 2 ^°p P ~ l n f )• Substituting this into the sample size equation of Lemma 12 and 
adapting the running time bound of Lemma 8 then yields the following theorem. 

Theorem 11 Algorithm EC can be implemented to compute an (e, 5) -approximation of the top-k 
most frequent objects in o(^ + • log j -falogn^ time in expectation. 

Proof. Sampling and hashing are done as in Algorithm PAC (Section 7.1). We select the object 
with rank k* = max(fc, ^ ' 21 ° gp j n g ) as a pivot. This requires expected time + (3^ + a log np) 
using Algorithm 1. The pivot is broadcast to all PEs, which then determine their subset of at least 
as frequent sample objects using +alogp) time in expectation. Next, these k* most frequently 

sampled objects are distributed to all PEs using an all-gather operation in time 0(/3k* + alogp). 
Now, the PEs count the received objects’ occurrences in their local input, which takes 0(n/p ) time. 
These counts are summed up using a vector-valued reduction, again requiring 0(/3k* + a log p) 
time. We then apply Algorithm 1 a second time to determine the k most frequent of these objects. 
Overall, the claimed time complexity follows by substituting k* for k' in the sampling probability 
from Lemma 10. □ 

Substituting k* from before then yields a total required sample size of pn = j ■ \J ln~^ for 
Algorithm EC. Note that this term grows with ^ instead of 1/e 2 , reducing per-PE communication 

volume to O log f) words. 

To continue the example from Figure 4, we may set k* = 8. Then, the k* most frequently 
sampled objects (E, A, T, I, O, D , H, S ) with (16,10,10, 9, 7, 8, 7, 6) occurrences, respectively, will be 
counted exactly. The result would now be correct. 
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Figure 5: Example of a distribution with a gap. The dashed lines indicate the upper and lower 
high-probability bounds on Xi estimated from the sample 


Note that Algorithm EC can be less communication efficient than Algorithm PAC if e is large, i.e. 
the result is very approximate. Then, k* can be prohibitively large, and the necessity to communicate 
the identity of the objects to be counted exactly, requiring time 0(/3k* + alogp), can cause a loss 
in communication efficiency. 


7.3 Probably Exactly Correct Algorithm 

If any significant gap exists in the frequency distribution of the objects (see Figure 5 for an example), 
we perform exact counting on all likely relevant objects, determined from their sample count. Thus, 
choose k* to ensure that the top-k most frequent objects in the input are among the top-A:* most 
frequent objects in the sample with probability at least 1 — 5. 

A probably exactly correct (PEC) algorithm to compute the top -k most frequent objects of a 
multiset whose frequency distribution is sufficiently sloped can therefore be formulated as follows. 
Take a small sample with sampling probability po, the value of which we will consider later. From 
this small sample, we deduce the required value of k* to fulfill the above requirements. Now, we 
apply Algorithm EC using this value of k*. Let Xi be the object of rank i in the input, and Sj that 
of rank j in the small sample. 


Lemma 12 It suffices to choose k* in such a way that s — y2po£fchi| = E [4] — 
\/2E [4] In | to ensure correctness of Algorithm PEC ’s result with probability at least 1 — 5. 


Proof. We use the Chernoff bound from Equation 1 to bound the probability of a top -k object not 
being among the top-A;* objects of the sample. Define s(xi) as the number of samples of the object 
with input rank i in the first sample, and x(§j) to be the exact number of occurrences in the input 
of the object with rank j in the first sample. Using X = s(xk), E [[] X] = poXk, and tp = 1 — 
we obtain 


^2 P ^04') - Sfc*] < k P [s(x fc ) < 4*] 
j=i 


< 


k ' C V p 0 x k J 


PO x k 
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We bound this value by the algorithm’s error probability 5 and solve for s*.*, which yields the claimed 
value. □ 

This only works for sufficiently sloped distributions, as otherwise k* 3> k would be necessary. 
Furthermore, it is clear that the choice of po presents a trade-off between the time and communication 
spent on the first sample and the exactness of k*, which we have to estimate more conservatively if 
the first sample is small. This is due to less precise estimations of E [.§/.] = p^Xk if po is small. To 
keep things simple, we can choose a relative error bound £o and use the sample size from our PAC 
algorithm of Theorem 7. The value of £o—and thus po —that minimizes the communication volume 
depends on the distribution of the input data. 

Theorem 13 If the value of k* computed from Lemma 12 satisfies k* = 0(k), then Algorithm PEC 
requires time asymptotically equal to the sum of the running times of algorithms PAC and EC from 
Theorems 7 and 11. 

Proof. (Outline) In the first sampling step, we are free to choose an arbitrary relative error 
tolerance £q. The running time of this stage is 0(/3^fjr log j + alogn) by Theorem 7. We then 

estimate k* by substituting the high-probability bound E [[] §k] > Sfc — y/2§k ln(l/<5) (whp) for its 
expected value in Lemma 12 (note that as sk increases with growing sample size and thus grows 
with falling £q, the precision of this bound increases). In the second stage, we can calculate the 
required value of £ from k* by solving the expression for k* in the proof of Theorem 11 for £, and 

obtain s = -^ y/ 21 ° gp log j. Since k* = 0(k), the second stage’s running time is as in Theorem 11. 
In sum, the algorithm requires the claimed running time. □ 


Zipf’s Law In its simplest form, Zipf’s Law states that the frequency of an object from a multiset M 
with \M\ = n is inversely proportional to its rank among the objects ordered by frequency. Here, 
we consider the general case with exponent parameter s, i.e. Xj = njj —, where H ns = i~ s is 
the n-th generalized harmonic number. 


Theorem 14 For inputs distributed according to Zipf’s law with exponent s, a sample size of 
pn = 4 k s H U)S In | is sufficient to compute a probably exactly correct result. Algorithm PEC then 

requires expected time H n)S log | + aTogn^ to compute the k most frequent objects 

with probability at least 1 — 5. 


Proof. Knowing the value distribution, we do not need to take the first sample. Instead, we can 
calculate the expected value of k* directly from the proof of Lemma 12 and obtain 


E [k*] = k 



This immediately yields p > In |, and we choose p = In | = 4 - In |, i.e. twice the required 

minimum value. This gives us the claimed sample size. Now, we obtain E [k*] = k (2 + y/2)* ~ 
V^3A1 k. In particular, k* is only a constant factor away from k. Plugging the above sampling 
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probability into the running time formula for our algorithm using exact counting, we obtain the 
exact top-A; most frequent objects with probability at least 1 — <5 in the claimed running time. □ 

Note that the number of frequent objects decreases sharply with s > 1, as the fc-tli most frequent 
one has a relative frequency of only 0(k~ s ). The k s H n ^ s term in the communication volume is thus 
small in practice, and, in fact, unavoidable in a sampling-based algorithm. (One can easily verify 
this claim by observing that this factor is the reciprocal of the k- th most frequent object’s relative 
frequency. It is clear that this object needs at least one occurrence in the sample for the algorithm 
to be able to find it, and that the sample size must thus scale with k s H n:S .) 

7.4 Refinements 

When implementing such an algorithm, there are a number of considerations to be taken into account 
to achieve optimal performance. Perhaps most importantly, one should apply local aggregation when 
inserting the sample into the distributed hash table to reduce the amount of information that needs 
to be communicated in practice. We now discuss other potential improvements. 

Choice of k*. In practice, the choice of k* in Section 7.2 depends on the communication channel’s 
characteristics /?, and, to a lesser extent, a , in addition to the problem parameters. Thus, an 
optimized implementation should take them into account when determining the number of objects 
to be counted exactly. 

Adaptive Two-Pass Sampling (Outline). The objectives of the basic PAC algorithm and its 
variant using exact counting could be unified as follows: we sample in two passes. In the first pass, we 
use a small sample size npo = fo(n, k, e, A) to determine the nature of the input distribution. From 
the insights gained from this first sample, we compute a larger sample size np\ = f\ (n. k, cf/., £, A). 
We then determine and return the k most frequent objects of this second sample. 

Additionally, we can further refine this algorithm if we can already tell from the first sample that 
with high probability, there is no slope. If the absolute counts of the objects in the sample are large 
enough to return the k most frequent objects in the sample with confidence, then taking a second 
sample would be of little benefit and we can return the k most frequent objects from the first sample. 

Using Distributed Bloom Filters. Communication efficiency of the algorithm using exact 
counting could be improved further by counting sample elements with a distributed single-shot bloom 
filter (dSBF) [34] instead of a distributed hash table. We transmit their hash values and locally 
aggregated counts. As multiple keys might be assigned the same hash value, we need to determine 
the element of rank k* + n instead of k*, for some safety margin k > 0. We request the keys of all 
elements with higher rank, and replace the (hash, value) pairs with (key, value) pairs, splitting them 
where hash collisions occurred. We now determine the element of rank k* on the k* + k + ^collisions 
elements. If an element whose rank is at most k* was part of the original k* + k elements, we are 
finished. Otherwise, we have to increase k to determine the missing elements. Observe that if the 
frequent objects are dominated by hash collisions, this implies that the input distribution is flat and 
there exist a large number of nearly equally frequent elements. Thus, we may not need to count 
additional elements in this case. 
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8 Top-A; Sum Aggregation 


Generalizing from Section 7, we now consider an input multiset of keys with associated non-negative 
counts, and ask for the k keys whose counts have the largest sums. Again, the input M is distributed 
over the p PEs so that no PE has more than 0(n/p) objects. Define m := v as the sum of 

all counts, and let no PE’s local sum exceed 0{m/p) 1 . We additionally assume that the local input 
and a key-aggregation thereof—e.g. a hash table mapping keys to their local sums—fit into RAM at 
every PE. 

It is easy to see that except for the sampling process, the algorithms of Section 7 carry over 
directly, but a different approach is required in the analysis. 

8.1 Sampling 

Let s be the desired sample size, and define u aV g := y as the expected count required to yield a sample. 
When sampling an object (k,v), its expected sample count is thus . To retain constant time 
per object, we add I —— I samples directly, and one additional sample with probability — I —— I 

L l^avg J l^avg L l^avg J 

using a Bernoulli trial. We can again use a geometric distribution to reduce the number of calls to 
the random number generator. 

To improve accuracy and speed up exact counting, we aggregate the local input in a hash table, 
and sample the aggregate counts. This allows us to analyze the algorithms’ error independent of 
the input objects’ distribution. A direct consequence is that for each key and PE, the number of 
samples deviates from its expected value by at most 1, and the overall deviation per key |sj — E [s,]| 
is at most p. 

8.2 Probably Approximately Correct Algorithms 

Theorem 15 We can compute an (e, 5)-approximation of the top-k highest summing items in 
expected time O ~ \Jp l°g f + a log n^j . 

Sampling is done using local aggregation as described in Section 8.1. From then on, we proceed 
exactly as in Algorithm PAC from Section 7.1. 

Proof. The part of an element’s sample count that is actually determined by sampling is the sum 
of up to p Bernoulli trials X\,... ,X p with differing success probabilities. Therefore, its expected 
value p is the sum of the probabilities, and we can use Hoeffding’s inequality to bound the probability 
of significant deviations from this value. Let X := Jff—i X t . Then, 

2 

P[\X-p\ >t]< 2e~~ . (4) 

We now use this to bound the likelihood that an object has been very mis-sampled. Consider 
an element j with exact sum x (j ) and sample sum Sj. For some threshold A, consider the element 

1 This assumption is not strictly necessary, and the algorithm would still be communication efficient without 
it. However, making this assumption allows us to limit the number of samples transmitted by each PE as s/p for 
global sample size s. To violate this criterion, a small number of PEs would have to hold f2 {s/p) elements that are 
likely to yield at least one sample, making up for a large part of the global sample size without contributing to the 
result. In such a rare setting, we would incur up to a factor of p in communication volume and obtain running time 
o(^ + /?^v^iogT + « lo g n ). 
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mis-sampled if | x(j ) — Sjv aV g | > y, i.e. its estimated sum deviates from the true value by more 
than ~2 in either direction. Thus, we substitute t = into Equation (4) and bound the result 

by ^ to account for all elements. Solving for s = . we obtain s > ^ \J~2p\n 

In total, we require time f° r sampling, 0((3 ^ logp + alogp) for insertion (as no PE’s local 

sum exceeds 0(m/p), none can yield more than 0(s/p ) samples), and 0(/3 1 + alogn) for selection. 
We then obtain the claimed bound as sum of the components. □ 

As in Section 7, we can use exact summation to obtain a more precise answer. We do not go 
into details here, as the procedure is nearly identical. The main difference is that a lookup in the 
local aggregation result now suffices to obtain exact local sums without requiring consultation of the 
input. 

9 Data Redistribution 

Let rii denote the number of data objects present at PE i. Let n = rij. We want to redistribute 
the data such that afterwards each PE has at most n = \n/p] objects and such that PEs with more 
than n objects only send data (at most n* — h objects) and PEs with at most h objects only receive 
data (at most n — n* objects). We split the PEs into separately numbered groups of senders Sj and 
receivers dj. We also compute the deficit dp— n—rii on receivers and the surplus sp = n^-h on senders. 
Then we compute the prefix sums d and s of these sequences (i.e., dp— Ylj<i 4/ an d Si '-~ ^2j<i s j )• 
Effectively, d enumerates the empty slots able to receive objects and s enumerates the elements to be 
moved. Now we match receiving slots and elements to be moved by merging the sequences d and s. 
This is possible in time O(alogp) using Batcher’s parallel merging algorithm [7]. A subsequence of 
the form (di, Sj ,..., Sj+fc, dj+ a , • • •, 4+fe) indicates that sending PEs j ,..., j + k move their surplus 
to receiving PE i (where sending PE j + k only moves its items numbered Sj+k-i + L-4+a — !)• This 
is a gather operation. Sending PE j + k moves its remaining elements to receiving PEs i + a..i + b. 
This is a scatter operation. These segments of PE numbers can be determined using segmented 
prefix operations [8]. Overall, this can be implemented to run in time 0(/3maxjnj + alogp). Even 
though this operation cannot remove worst case bottlenecks, it can significantly reduce network 
traffic. 

10 Experiments 

We now present an experimental evaluation of the unsorted selection from Section 4.1 and the top-A; 
most frequent objects algorithms from Section 7. 

Experimental Setup. All algorithms were implemented in C++11 using OpenMPI 1.8 and 
Boost.MPI 1.59 for inter-process communication. Additionally, Intel’s Math Kernel Library in 
version 11.2 was used for random number generation. All code was compiled with the clang++ compiler 
in version 3.7 using optimization level -Of ast and instruction set specification -march=sandybridge. 
The experiments were conducted on InstitutsCluster II at Karlsruhe Institute of Technology, a 
distributed system consisting of 480 computation nodes, of which 128 were available to us. Each 
node is equipped with two Intel Xeon E5-2670 processors for a total of 16 cores with a nominal 
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Figure 6: Weak scaling plot for selecting the fc-th largest object, n/p = 2 28 , Zipf distribution. 


frequency of 2.6 GHz, and 64GiB of main memory. In total, 2048 cores were available to us. An 
Infiniband 4X QDR interconnect provides networking between the nodes. 

Methodology. We run weak scaling benchmarks, which show how wall-time develops for fixed 
per-PE problem size n/p as p increases. We consider p = 1 to 2048 PEs, doubling p in each step. 
Each PE is mapped to one physical core in the cluster. 

Zipf’s Law states that the frequency of an object from a multiset M with \M\ = n is inversely 
proportional to its rank among the objects ordered by frequency. Here, we consider the general case 
with exponent parameter s, i.e. Xi = n-jj-^-, where H n s = Y^=i i s the ra-th generalized harmonic 
number. 

10.1 Unsorted Selection 

Input Generation. We select values from the high tail of Zipf distributions. Per PE, we con¬ 
sider 2 24 , 2 26 , and 2 28 integer elements. To test with non-uniformly distributed data, the PE’s 
distribution parameters are randomized. The Zipf distributions comprise between 2 20 — 2 16 and 2 20 
elements, with each PE’s value chosen uniformly at random. Similarly, the exponent s is uniformly 
distributed between 1 and 1.2. This ensures that several PEs contribute to the result, so that the 
distribution is asymmetric, without the computation becoming a local operation at one PE, which 
has all of the largest elements. 
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We used several values of k, namely 1024, 2 20 , and 2 26 . We do not consider smaller values than 
1024, as for values this small, it would be more efficient to locally select the k largest (or smallest) 
elements, and run a simple distributed selection on those. 

Results. Figure 6 shows the results for selecting the k- th largest values from the input, for the 
above values of k. We can see that in most cases, the algorithm scales even better than the bounds 
lead us to expect—running time decreases as more PEs are added. This is especially prominent 
when selecting an element of high rank (k = 2 26 in Figure 6). The majority of the time is spent 
in partitioning, i.e. local work, dominating the time spent on communication. This underlines the 
effect of communication efficiency. 

10.2 Top-/c Most Frequent Objects 

As we could not find any competitors to compare our methods against, we use two naive centralized 
algorithm as baseline. The first algorithm, Naive, samples the input with the same probability as 
algorithm PAC, but instead of using a distributed hash table and distributed selection, each PE sends 
its aggregated local sample to a coordinator. The coordinator then uses quickselect to determine 
the elements of rank 1 ..k in the global sample, which it returns. Algorithm Naive Tree proceeds 
similarly, but uses a tree reduction to send the elements to the coordinator to reduce latency. Similar 
to Algorithm PAC' s hash table insertion operation, this reduction aggregates the counts in each 
step to keep communication volume low. 

Input Generation. We consider 2 24 , 2 26 and 2 28 elements per PE, which are generated according 
to different random distributions. First, we consider elements distributed according to Zipf’s Law 
with 2 20 possible values. These values are very concentrated and model word frequencies in natural 
languages, city population sizes, and many other rankings well [1,41], the most frequent element 
being j-times more frequent than that of rank j. Next, we study a negative binomial distribution 
with r = 1000 and success probability p = 0.05. This distribution has a rather wide plateau, resulting 
in the most frequent objects and their surrounding elements all being of very similar frequency. For 
simplicity, each PE generates objects according to the same distribution, as the distributed hash 
table into which the sample is inserted distributes elements randomly. Thus, tests with non-uniformly 
distributed data would not add substantially to the evaluation. 

Approximation Quality. To evaluate different accuracy levels, we consider the (e, (5) pairs 
(3 • 1CF 4 ,1CF 4 ) and (10 _6 ,1CF 8 ). This allows us to evaluate how running time develops under 
different accuracy requirements. 

We then select the k = 32 most frequent elements from the input according to the above 
requirements. We do not vary the parameter k here, as it has very little impact on overall performance. 
Instead, we refer to Section 10.1 for experiments on unsorted selection, which is the only subroutine 
affected by increasing k and shows no increase up to k = 2 20 . 

Results. Figure 7 shows the results for 2 28 elements per PE using e = 3 • 10 -4 and 5 = 1CF 4 . We 
can clearly see that Algorithm Naive does not scale beyond a single node at all (p > 16). In fact, its 
running time is directly proportional to p, which is consistent with the coordinator receiving p — 1 
messages—every other PE sends its key-aggregated sample to the coordinator. Algorithm Naive 
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Figure 7: Weak scaling plot for computing the 32 most frequent objects, e = 3 • 10 4 , 5 = 10 4 . EC 
suffers from constant overhead for exact counting. 



Number of PEs 

Figure 8: Weak scaling plot for computing the 32 most frequent objects, n/p = 2 28 , e = 10” 6 
5 = 10 -8 . Only EC can use sampling, other methods must consider all objects. 
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Tree fares better, and actually improves as more PEs are added. This is easily explained by the 
reduced sample size per PE as p increases, decreasing sampling time. However, communication time 
begins to dominate, as the decrease in overall running time is nowhere near as strong as the decrease 
in local sample size. This becomes clear when comparing it to Algorithm PAC , which outperforms 
Naive Tree for any number of PEs. We can see that it scales nearly perfectly—doubling the number 
of PEs (and thereby total input size) roughly halves running time. Since these three algorithms all 
use the same sampling rate, any differences in running time are completely due to time spent on 
communication. 

Lastly, let us consider Algorithm EC. In the beginning, it benefits from its much smaller sample 
size (see Section 7.2), but as p grows, the local work for exact counting dominates overall running 
time strongly and Algorithm EC is no longer competitive. Since local work remains constant with 
increasing p, we see nearly no change in overall running time. To see the benefits of Algorithm EC, 
we need to consider stricter accuracy requirements. 

In Figure 8, we consider e = ICE 6 and 6 = ICE * 8 . For Algorithms PAC, Naive, and Naive Tree, 
this requires considering the entire input for any number of PEs, as sample size is proportional to X, 
which the other terms cannot offset here. Conversely, Algorithm EC’s sample size depends only 
linearly on e, resulting in sample sizes orders of magnitude below those of the other algorithms. 

Again, we can see that Algorithm Naive is completely unscalable. Algorithm Naive Tree performs 
much better, with running times remaining roughly constant at around 6.5 seconds as soon as 
multiple nodes are used. Algorithm PAC suffers a similar fate, however it is slightly faster at 6.2 
seconds. This difference stems from reduced communication volume. However, both are dominated 
by the time spent on aggregating the input. Lastly, Algorithm EC is consistently fastest, requiring 
4.1 seconds, of which 3.7 seconds are spent on exact counting 2 . This clearly demonstrates that 
Algorithm EC is superior for small e. 

Smaller local input sizes do not yield significant differences, and preliminary experiments with 
elements distributed according to a negative binomial distribution proved unspectacular and of 
little informational value, as the aggregated samples have much fewer elements than in a Zipfian 
distribution—an easy case for selection. 

11 Conclusions 

We have demonstrated that a variety of top-/c selection problems can be solved in a communication 
efficient way, with respect to both communication volume and latencies. The basic methods are simple 
and versatile—the owner-computes rule, collective communication, and sampling. Considering the 
significant previous work on some of these problems, it is a bit surprising that such simple algorithms 
give improved results for such fundamental problems. However, it seems that the combination of 
communication efficiency and parallel scalability has been neglected for many problems. Our methods 
might have particular impact on applications where previous work has concentrated on methods 
with a pure master-worker scheme. 

It is therefore likely that our approach can also be applied to further important problems. For 
example, distributed streaming algorithms that generalize the centralized model of Yi and Zhang [40] 
seem very promising. The same holds for lower bounds, which so far have also neglected multiparty 
communication with point-to-point communication (see also [34]). 

2 When not all cores are used, memory bandwidth per core is higher. This allows faster exact counting for p = 1 to 

8 cores on a single node. 
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Closer to the problems considered here, there is also a number of interesting open questions. 
For the sorted selection problem from Section 4.2, it would be interesting to see whether there is a 
scalable parallel algorithm that makes an information theoretically optimal number of comparisons as 
in the sequential algorithm of Varman et al. [38]. Our analysis of approximate multiselection ignores 
the case where k — k = o(k). It can probably be shown to run in expected time 0(alogplog 
For the multicriteria top -k problem from Section 6, we could consider parallelization of advanced 
algorithms that scan less elements and perforin less random accesses, such as [6]. 

Regarding the top -k most frequent objects and sum aggregation, we expect to be able to conduct 
fully distributed monitoring queries without a substantial increase in communication volume over 
our one-shot algorithm. 
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(* select object with global rank k *) 

Procedure msSelect(s, k ) 

if = 1 then — base case 

return the only nonempty object 
select a pivot v uniformly at random 
find j such that s[l..j] < v and s[j + 1..] > v 
if Ei<i< P \j@i\ > k then 

return msSelect(s[l..j], k) 

else 

return msSelect(s[ji + 1..], A; - Yo<i< P \j @i \) 


Algorithm 9: Multisequence selection. 


A Multisequence Selection [ 2 ] 

Figure 9 gives high level pseudo code. The base case occurs if there is only a single object (and k = 1). 
We can also restrict the search to the first k objects of each local sequence. A random object is 
selected as a pivot. This can be done in parallel by choosing the same random number between 1 
and |s@i| on all PEs. Using a prefix sum over the sizes of the sequences, this object can be 
located easily in time (9 (a log p). Where ordinary quickselect has to partition the input doing linear 
work, we can exploit the sortedness of the sequences to obtain the same information in time O(logcr) 
with a := max* |s@i| by doing binary search in parallel on each PE. If items are evenly distributed, 
we have a = @(min(&:, ^)), and thus only time (9(logmin(/c, ^)) for the search, which partitions all 
the sequences into two parts. Deciding whether we have to continue searching in the left or the 
right parts needs a global reduction operations taking time (9 (a log p). As in ordinary quickselect, 
the expected depth of the recursion is (9(log£T \di\) = (9 (log min(kp, n)). We obtain the following 
result. 

Theorem 16 Algorithm 9 can be implemented to rim in expected time 
(9 ^ log p + log min • log min (kp, n) 

B Running Times of Existing Algorithms 

We now prove the running times given for previous works in Table 1. 

B.0.1 Unsorted Selection 

Previous algorithms, see [31], rely on randomly distributed input data or they explicitly redistribute 
the data. Thus, they have to move O(^) elements in the worst case. The remainder of the bound 
follows trivially. □ 


^ = (9 (a log 2 kp) 
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B.l Sorted Selection 


We could not find any prior results on distributed multiselection from sorted lists and thus list a 
sequential result by Varman et al. [38]. 

B.2 Bulk Parallel Priority Queue 

The result in [31] relies on randomly distributed input data. Therefore, in operation insert*, each 
PE needs to send its 0(k/p ) elements to random PEs, costing 0((a + /3)^) time. Then, operation 
deleteMin* is fairly straight-forward and mostly amounts to a selection. The deterministic parallel 
heap [13] needs to sort inserted elements and then they travel through log ^ levels of the data 
structure, which is allocated to different PEs. This alone means communication cost P (/?| log p). 

B.3 Heavy Hitters Monitoring 

Huang et al. give a randomized heavy hitters monitoring algorithm [19] that, for constant failure 
probability e, requires time + a ^ logn^ in our model. 

Proof. All communication is between the controller node and a monitor node, thus the maximum 
amount of communication is at the controller node. Each update, which consists of a constant 
number of words, is transmitted separately. Thus, the communication term given by the authors 
transfers directly into our model (except that the number of monitor nodes k is p — 1 here). □ 

B.4 Top-k Frequent Objects Monitoring 

Monitoring Query 1 of [3] performs top-A; most frequent object monitoring, for which it incurs a 
running time of P for relative error bound 7 = where A corresponds to e in their 

notation. This algorithm also has further restrictions: It does not provide approximate counts of the 
objects in the top -k set. It can only handle a small number N of distinct objects, all of which must 
be known in advance. It requires Q(Np) memory on the coordinator node, which is prohibitive if N 
and p are large. It must also be initialized with a top-A: set for the beginning of the streams, using 
an algorithm such as TA [15]. We now present a family of inputs for which the algorithm uses the 
claimed amount of communication. 

Initialize with N = k + 2 items 0 1 ,.. •, Oj ~ + 2 that all have the same frequency. Thus, the initial 
top-k set comprises an arbitrary k of these. Choose one of the two objects that are not in the top-A; 
set, and refer to this object as O s , and pick a peer (PE) Nf. Now, we send O s to Nf repeatedly, 
and all other items to all other peers in an evenly distributed manner (each peer receives around 
the same number of occurrences of each object). After at most 2A steps, the top-A; set has become 
invalid and an expensive full resolution step is required 3 . As we expect A to be on the large side, 
we choose Fq = 0 and Fj = for j > 0 as per the instructions in [3]. We can repeat this cycle 
to obtain an instance of the example family for any input size n. Note that the number of “cheap” 
resolution steps during this cycle depends on the choice of the Fj values, for which Babcock and 
Olston give rough guidelines of what might constitute a good choice, but do not present a theoretical 

3 This actually happens much earlier, the first “expensive” resolution is required after only O(A) steps. This number 
increases, but will never exceed A(ft + 1) _ * < 2A. 
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analysis of how it influences communication. Here, we ignore their cost and focus solely on the cost 
of “expensive” resolutions. 

By the above, an “expensive” resolution round is required every (at most) 0(pA) items in the 
input. Since the resolution set contains k + 1 objects (the top -k set plus the non-top -k object with 
a constraint violation), each “expensive” resolution has communication cost @((3kp + ap). Thus, 
we obtain a total communication cost for expensive resolutions of f l(^(/3kp + ap)) = £l(/3^ + a^), 
for a given relative error bound 7 . Additionally, each item requires at least constant time in local 
processing, accounting for the additive fl(^) term. The actual worst-case communication cost is likely 
even higher, but this example suffices to show that the approach of [3] is not communication-efficient 
in our model. □ 

B.5 Multicriteria Top-k 

Previous algorithms such as TPUT [10] or KLEE [25] are not directly comparable to our approach 
for a number of reasons. First, they do not support arbitrary numbers of processing elements, 
but limit p to the number of criteria m. Each PE is assigned one or more complete index lists, 
whereas our approach splits the objects among the PEs, storing the index lists of the locally present 
objects on each PE. This severely limits TPUT’s and KLEE’s scalability. Secondly, as noted in 
the introduction, these algorithms use a centralized master-worker approach, where the master (or 
coordinator) node handles all communication. This further limits scalability and leads to an inherent 
increase in communication volume by a factor of up to p. Thirdly, they are explicitly designed 
for wide-area networks (WANs), whereas our algorithms are designed with strongly interconnected 
PEs in mind, as they might be found in a data center. Since the modeling assumptions are too 
different to provide a meaningful comparison, we refrain from giving a communication analysis of 
these algorithms (nor was one provided in the original papers). 
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